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Abstract. This paper studies parallelization schemes for stochastic Vector 
Quantization algorithms in order to obtain time speed-ups using distributed 
resources. We show that the most intuitive parallelization scheme does 
not lead to better performances than the sequential algorithm. Another 
distributed scheme is therefore introduced which obtains the expected speed- 
ups. Then, it is improved to fit implementation on distributed architectures 
where communications are slow and inter-machines synchronization too 
costly. The schemes are tested with simulated distributed architectures 
and, for the last one, with Microsoft Windows Azure platform obtaining 
speed-ups up to 32 Virtual Machines. 

1 Introduction 

Motivated by the problem of executing clustering algorithms on very large 
datasets, we investigate parallelization schemes of the stochastic Vector Quantiza- 
tion (VQ) method (also called online fc-means). This procedure is known for its 
good statistical properties but it does not exhibit the embarrassing parallelism of 
the (batch) fc-means. Given a satisfactory sequential implementation of the VQ 
algorithm, we aim at speeding up its execution through a parallel implementation: 
the ultimate goal is to reduce the wall time used by the method on a given dataset, 
that is the time needed to reach some performance threshold, using more than 
one computing unit. Theoretical parallel VQ algorithms are studied in lj. The 
aim of the present paper is to derive actual real world implementations. 

The VQ technique computes a summary of a dataset {z t }™ =1 of d dimensional 
samples with k prototypes, w — (wi, . . . , w K ) € (R d ) . Starting from a random 

initial io (0) E (R d ) K and given a series of positive steps (et)t>0; VQ produces 
a series of w(t) by updating w prototype by prototype. More precisely, with 

II 1 1 2 

l(t) = argmin l=1 iK ||z{t+i mod «} - || > we have 

w{t+1)i= Ut)i wbani^W 
[w(t)i - e t+ i(w(t)i - z {t+1 mod „}) when i = l(t) 

where the mod operator stands for the remainder of an integer division operation. 
A theorem about almost sure convergence of the VQ procedure is proved in [2]. 
It is well known that the VQ algorithm belongs to the class of stochastic gradient 
descent algorithms (see [3] for instance). 



This paper follows the VQ ideas presented in [T]. We assume having access to 
M computing entities, each of them executing concurrent VQ procedures. These 
executions are performed on a dataset, split among the local memory of the 
computing instances, and represented by the sequences {zj}™ =1 , i £ {1, . . . , M}. 
The prototype iterations computed by the VQ techniques on each node are 
denoted by {w l (t)}^ and called versions. We use the following normalized 
criterion to measure the speed-up ability of our investigated schemes. 

M n 

c n ,MH = ^EE,_f n Jl z ^^H 2 . ™g(rT- (2) 

i=l t=l 

The rest of the paper is organized as follows. First, Section [2] provides 
empirical evidences that the most simple scheme cannot bring speed-ups. Then, 
some insights to explain the previous non satisfactory situation are provided in 
Section [3] Consequently, we design a new scheme and prove by practice its ability 
to bring speed-ups. Finally, in Section [4j we present an asynchronous adaptation 
of this latter scheme which fits better slow communication architectures such as 
Cloud Computing. 

Notice that the proposed algorithms are tested using simulated distributed 
architecture and synthetic vector datsQ but our conclusions are more sensitive 
to the loss function smoothness and convexity than to the data choice. 

2 A first distributed scheme 

Our investigation starts with the most intuitive parallelization scheme. Each 
computing resource starts with the same initial prototypes (a.k.a., versions): 
u> 1 (0) = ... = w M (0). Then each machine applies the sequential VQ to its 
subset of the dataset. Once in a while, prototypes are synchronized: when r data 
points have been processed by each concurrent processor, a shared version of the 
prototypes is computed as follows (here for the first synchronization event): 

1 M 

» srd ^E^M- (3) 

The shared version is then broadcasted to each processing unit. In the case of a 
smooth convex loss function, distributed stochastic gradient descent algorithms 
with averaging of local results provide a speed-up in comparison of the sequential 
algorithm (see [3]). Figure [l] shows a typical evolution through wall time of the 
quantization error, obtained with an execution of the scheme on a simulated 
parallel implementation in which communications are instantaneous. It shows 
up that in our non smooth and convex loss function case, multiple resources do 
not bring speed-ups for convergence. Even if more data are processed, no gain in 
term of wall clock time is provided using this parallel scheme. 

lr The source code is available at http://code.google.eom/p/clouddalvq/ Details about 
the artificial data generator are available in Section 4.2 of http://www.lsta.upmc.fr/doct/ 
patra/publications/PhDMain.pdf 
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Figure 1: Charts of performance curves for iterations ^ with r = 10 and different 
number of computing entities: M = 1, 2, 10. 



3 Towards a better scheme 

The investigation of the previous non-satisfactory result starts by rewriting both 
the sequential and the distributed scheme. Let us first introduce H(z,w) defined 

by 

H(z, W ) = ((^-z)l {i=argmin _ j^i,.})^. (4) 

Then, a series of the sequential VQ iterations ([!]) can be rewritten: 

t 

w(t + l)=w(t-T + l)- ^2 £ t'+\H {■L {v+lmod n },w{t')) , t>T. (5) 

t'=t— T+l 

Then, just after a synchronization (defined by t mod t = and t > 0), for all 
i G {1, . . . , M}, the sequential VQ iterations on each computational resource can 
be rewritten as follows 



t 



( M 
j = l 

(6) 

Assuming that w>(t') « w l (t'), for all e {l,...,Af} 2 and t' > 0, the 

mean in parenthesis is an estimator of the gradient of the distortion at w l (t'). 
Consequently, the two algorithms induced by iterations ^ and Q can be thought 
as stochastic gradient descent procedures with different estimators of the gradient 
but driven by the same learning rate which is given by the sequence (£f)t>o- 

The convergence speed of a non-fixed step gradient descent procedure is 
essentially driven by the decreasing speed of the sequence of steps. The choice 
of this sequence is subject to an exploration/convergence trade-off. Since the 
two procedures above share the same learning rate with respect to the iterations 
t > 0, they share the same convergence speed with respect to the wall clock time 
(time measured by an exterior observer). Yet, the distributed scheme of Section [2] 



has a much lower learning rate with respect to the number of samples processed, 
favoring exploration to the detriment of the convergence. The multiple resources 
therefore lead to better exploration but to similar convergence speed with respect 
to wall clock time. 

As we assume to have a satisfactory VQ implementation, the series of steps 
(e*)t>o is supposed to be adapted to the dataset. Consequently we should seek 
for a distributed scheme that have the same learning rate evolution in term of 
processed samples and which convergence speed with respect to iterations is 
accelerated. Denoting 

t 2 

A Ut 2 = E H'+i mod „ } .^'(0) > 3 € {I,- - -,M] and t 2 > h > 0. 

t'=tt+i 

(7) 

At time t = 0, io x (0) = . . . = w M (0) = w srd . For alii G {1, . . . , M} and all t > 0, 
consider the distributed scheme given by 

= w*(t)~e t+1 H(z\ t+lmod n} ,w\t)) 

1) = w\ emp if t mod t 7^ or t = 0, 




srd _ \pM Aj 

Z^=i t-r^t if t mod r = o and t > r. 

1) = w srd 

(8) 

The main difference between the two parallel schemes consists in the way results 
are merged in the reducing phase (described by the braced inner equations) : here 
we apply the translation calculated by each parallel VQ to the current shared 
version of the prototypes, rather than averaging this translation. The results of 
a typical application of this scheme are displayed in the charts of Figure [2] The 
charts show that substantial speed-ups are obtained with distributed resources. 
The acceleration is greater when the reducing phase is frequent. Indeed, if r is 
large then more autonomy has been granted to the concurrent executions, they 
could be attracted to different regions that would slow down the consensus and 
the convergence. 



4 A model with stochastic delays 

The previous parallelization schemes do not deal with communication costs 
introduced by update exchanges between machines. In the context of cloud 
computing, no efficient shared memory is available and these costs introduce 
delays. The effect of delays for parallel stochastic gradient descent has already 
been studied (see for instance [4]) but for a computing architecture endowed with 
an efficient shared memory. Moreover, the unreliability of the cloud computing 
hardware introduces strong straggler issues and makes the synchronization process 
inappropriate. In this subsection, we improve the model of iterations ([8| with 
random communication costs that follow a geometric distribution and we remove 
the synchronization process of reducing phase, resulting in the more realistic 
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Figure 2: Charts of performance curves for iterations |8]) with r = 10 and different 
number of computing entities: M = 1, 2, 10. 

iterations ^ below. For each time t > 0, let r l (t) be the latest time before t 
when the unit i finished to send its updates and received the shared version. At 
time < = 0we have w 1 (0) = . . . = w M (0) = w srd , and for all i € {1, . . . , M} and 
all t > 0, 

w\emp = «>*(*) - £t+iH (z\ t+1 mod „ } ,u. 4 (t)) 

w i {t + l)=w\ emv iit^T^t) 
< w\t + 1) = ^"VO- 1)) - A; i( j_ lHt ift = r*(i) ( 9 ) 

j-t=rHt) 




Figure 3: Charts of performance curves for iterations |9]) with t = 10 and different 
number of computing entities: M = 1, 2, 10. 

There are no more synchronization between processing units: each machine 
uploads its updates and downloads the shared version as soon as its previous 
uploads and downloads are completed. A dedicated unit permanently modifies 



the shared version with the latest updates received from the other machines 
without any synchronization barrier. The Figure [3] shows that the introduction 
of small delays and asynchronism only slightly impacts performances, compared 
to the scheme given by equations ([8| . The Figure [4] shows the results obtained 
by our cloud implementation^] of the iterations |9| using 32 real processing units. 
A future paper will describe more precisely this cloud implementation. 
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Figure 4: Charts of performance curves for iterations |9]) on our cloud implemen- 
tation and different number of computing entities. 



5 Conclusion 

In this paper we show that the naive parallelization scheme proposed in Section [2] 
does not provide better performance than the sequential scheme. This surprising 
result derives from the fact that our first parallel scheme leads to a decrease 
of the learning rate per data points processed. We therefore propose a new 
parallelization scheme relying on asynchronous updates of a common " shared 
version" . This latter algorithm is very well suited for parallel computation on slow 
communication networks such as cloud computing platforms. Our implementation 
on Azure shows significant scale-up, up to 32 machines. 
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